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Abstract — We examine the problem of creating an encoded 
distributed storage representation of a data object for a network 
of mobile storage nodes so as to achieve the optimal recovery 
delay. A source node creates a single data object and disseminates 
an encoded representation of it to other nodes for storage, 
subject to a given total storage budget. A data collector node 
subsequently attempts to recover the original data object by 
contacting other nodes and accessing the data stored in them. 
By using an appropriate code, successful recovery is achieved 
when the total amount of data accessed is at least the size of 
the original data object. The goal is to find an allocation of the 
given budget over the nodes that optimizes the recovery delay 
incurred by the data collector; two objectives are considered: 
(i) maximization of the probability of successful recovery by a 
given deadline, and (ii) minimization of the expected recovery 
delay. We solve the problem completely for the second objective 
in the case of symmetric allocations (in which all nonempty nodes 
store the same amount of data), and show that the optimal 
symmetric allocation for the two objectives can be quite different. 
A simple data dissemination and storage protocol for a mobile 
delay-tolerant network is evaluated under various scenarios via 
simulations. Our results show that the choice of storage allocation 
can have a significant impact on the recovery delay performance, 
and that coding may or may not be beneficial depending on the 
circumstances. 

I. Introduction 

Consider a network of n mobile storage nodes. A source 
node creates a single data object of unit size (without loss 
■ of generality), and disseminates an encoded representation of 
it to other nodes for storage, subject to a given total storage 
budget T. Let Xi be the amount of coded data eventually stored 
in node i e {1, . . . ,n} at the end of the data dissemination 
process. Any amount of data may be stored in each node, as 
long as the total amount of storage used over all nodes is at 
most the given budget T, that is, J^iLi x i ^ T. 

At some time after the completion of the data dissemination 
process, a data collector node begins to recover the original 
data object by contacting other nodes and accessing the data 
stored in them. We make the simplifying assumption that 
the stored data is instantaneously transmitted on contact; this 
approximates the case where there is sufficient bandwidth 
and time for data transmission during each contact. This 
data recovery process continues until the data object can be 
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Fig. 1. Information flows originating at the source s, some of which finally 
arrive at the data collector t. Different amounts of coded data may eventually 
be stored in each storage node, subject to the given total storage budget T. 

recovered from the cumulatively accessed data. Let random 
variable D denote the recovery delay incurred by the data 
collector, defined as the earliest time at which successful 
recovery can occur, measured from the beginning of the data 
recovery process. Fig. Q] depicts the information flows in such 
a network. 

By using an appropriate code for the data dissemination 
process and eventual storage, successful recovery can be 
achieved when the total amount of data accessed by the data 
collector is at least the size of the original data object. This 
can be accomplished with random linear codes (TJ, 12 or a 
suitable MDS code, for example. Thus, if C {1, . . . , n} is 
the set of all nodes contacted by the data collector by time d, 
then the recovery delay D can be written as 




Our goal is to find a storage allocation (xi, . . . , x n ) that 
produces the optimal recovery delay, subject to the given 
budget constraint. Specifically, we shall examine the following 
two objectives involving the recovery delay D: 

(i) maximization of the probability of successful recov- 
ery by a given deadline d, or recovery probability 
P [D < d], and 

(ii) minimization of the expected recovery delay E [D] . 

By solving for the optimal allocation, we will also be able to 
determine whether coding is beneficial for recovery delay. For 
example, uncoded replication would suffice if each nonempty 
node is to store the data object in its entirety (i.e. Xi > 1 for 
all i € S, and Xi — for all i ^ S, where S is some subset 
of {!,..., n}); the data collector would not need to combine 
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data accessed from different nodes in order to recover the data 
object. 

The nodes of the network are assumed to move around and 
contact each other according to an exogenous random process; 
they are unable to change their trajectories in response to 
the data dissemination or recovery processes. (The recovery 
delay could be improved significantly if nodes were otherwise 
allowed to act on oracular knowledge about future contact 
opportunities J3), for example.) 

Most work on delay-tolerant networking traditionally as- 
sume that the data object is intended for immediate consump- 
tion; both the data dissemination and recovery processes would 
therefore begin at the same time, and the recovery delay would 
be measured from the beginning of the data dissemination 
process. In contrast, our model more accurately reflects the 
characteristics of longer-term storage where the data object 
can be consumed long after its creation. Nonetheless, our 
model can still be a good approximation for short-term storage 
especially when the data dissemination process occurs very 
rapidly, as in the case of binary SPRAY-AND-WAIT 0] where 
the number of nodes disseminating or spraying data grows 
exponentially over time. 

We also note that in most of the literature involving 
distributed storage, either the data object is assumed to be 
replicated in its entirety (see, for e.g., J4j), or, if coding is 
used, every node is assumed to store the same amount of 
coded data (see, for e.g., |3)-||9]). Allocations of a storage 
budget with nodes possibly storing different amounts of data 
are not usually considered. 

A. Our Contribution 

This paper attempts to address the gaps in our understanding 
of how the choice of storage allocation can affect the recovery 
delay performance. We formulate a simple analytical model of 
the problem and show that the maximization of the recovery 
probability P [D < d] can be expressed in terms of the reli- 
ability maximization problem introduced in ifTol . It turns out 
that the simple strategies of spreading the budget minimally 
(i.e. uncoded replication) and spreading the budget maximally 
over all n nodes (i.e. assigning x% = — for all i) may both be 
suboptimal; in fact, the optimal allocation may not even be 
symmetric (we say that an allocation is symmetric when all 
nonzero x% are equal). Applying our earlier results irTTll . we 
can show that minimal spreading is optimal among symmetric 
allocations when the deadline d is sufficiently small, while 
maximal spreading is optimal among symmetric allocations 
when the deadline d is sufficiently large. 

For the minimization of the expected recovery delay E [D], 
we are able to characterize the optimal symmetric allocation 
completely: minimal spreading (i.e. uncoded replication) turns 
out to be optimal whenever the budget T is an integer; 
otherwise, the amount of spreading in the optimal symmetric 
allocation increases with the fractional part of T. 

Interestingly, our analytical results demonstrate that the 
optimal symmetric allocation for the two objectives can be 



quite different. In particular, when the budget T is an inte- 
ger, we observe a phase transition in the optimal symmetric 
allocation as the deadline d increases, for the maximization of 
recovery probability V[D < d]; however, minimal spreading 
(i.e. uncoded replication) alone turns out to be optimal for the 
minimization of expected recovery delay E [D]. 

We proceed to apply our theoretical insights to the design of 
a simple data dissemination and storage protocol for a mobile 
delay-tolerant network. Our protocol generalizes SPRAY-AND- 
WAIT (4) by allowing the use of variable-size coded packets. 
Using network simulations, we compare the performance of 
different symmetric allocations under various circumstances. 
These simulations allow us to capture the transient dynamics 
of the data dissemination process that were simplified in 
the analytical model. Our main result shows that a maximal 
spreading of the budget is optimal in the high recovery 
probability regime. Specifically, maximal spreading can lead 
to a significant reduction in the wait time required to attain 
a desired recovery probability. We also evaluate the protocol 
against a real-world data set consisting of the mobility traces of 
taxi cabs operating in a city. Besides validating the predictions 
made in our theoretical analysis, these simulations also reveal 
several interesting properties of the allocations under different 
circumstances. 

B. Other Related Work 

Jain et al. Q2l and Wang et al. fOI evaluated the delay 
performance of symmetric allocations experimentally in the 
context of routing in a delay-tolerant network. Our results 
complement and generalize several aspects of their work. 

We present a theoretical analysis of the problem in Sec- 
tion OH and undertake a simulation study in Section [TTTJ Proofs 
of theorems are deferred to the appendix. 

II. Theoretical Analysis 
We adopt the following notation throughout the paper: 

n total number of storage nodes, n > 2 

A contact rate between any given pair of nodes, A > 

X[ amount of data stored in node i G {1, . . . , n}, xi > 

T total storage budget, 1 < T < n 

D random variable denoting recovery delay 

The indicator function is denoted by I[G], which equals 1 
if statement G is true, and otherwise. We use B(n,p) 
to denote the binomial random variable with n trials and 
success probability p. An allocation (xi, . . . , x n ) is said to 
be symmetric when all nonzero Xi are equal; for brevity, let 
x(n, T, m) denote the symmetric allocation for n nodes that 
uses a total storage of T and contains exactly m E {1, . . . , n} 
nonempty nodes, that is, 



m terms (n — m) terms 

The number of contacts between any given pair of nodes 
in the network is assumed to follow a Poisson distribution 
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with rate parameter A; the time between contacts is therefore 
described by an exponential distribution with mean j. Let 
Wi, . . . , W n be i.i.d. random variables denoting the times at 
which the data collector first contacts node 1, . . . , n, respec- 
tively, where Wi ~ Exponential A). 

A. Maximization of Recovery Probability P [D < d] 

Let the given recovery deadline be d > 0, and let the 
subset of nodes contacted by the data collector by time d be 
r C {1, . . . , n}. Successful recovery occurs by time d if and 
only if the total amount of data stored in the subset r of nodes 
is at least 1 . In other words, the recovery delay D is at most d 



Therefore, assuming EILi x i — 1 which is necessary for 
successful recovery, we can compute the expected recovery 
delay as follows: 
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if and only if Eigr 3 * ^ L Since the data collector contacts where Hn A £^ \ is the n th harmonic number. We seek 



each node by time d independently with constant probability 
Px,d, given by 
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Px.d = P[W <d]=F w (d) = l-e 

it follows that the probability of contacting exactly a subset 
r of nodes by time d is pl r L(l — P\,d) n ~^- The recovery 
probability P [D < d] can therefore be obtained by summing 
over all possible subsets r that allow successful recovery: 



>[D<d]= J2 PA,d(l-PA,d) f 



rC{l, ...,n}: 
r|>l 



E 



Xi > 1 



(1) 



We seek an optimal allocation (xi, . . . ,x n ) of the budget T 
(that is, subject to ElLi x * — T, where Xi > for all i) that 
maximizes P [D < d], for a given choice of n, A, d, and T. 

This problem matches the reliability maximization problem 
of IfTTI with px : d as the access probability; we recall that the 
optimal allocation may be nonsymmetric and can be difficult to 
find. However, if we restrict the optimization to only symmetric 
allocations, then we can specify the solution for a wide range 
of parameter values of p\ t d and T. Specifically, if A or d is suf- 
ficiently small, e.g. p\ : d < yyy, then x (n, T, m=\T\ ), which 
corresponds to a minimal spreading of the budget (i.e. uncoded 
replication), is an optimal symmetric allocation. On the other 
hand, if A or d is sufficiently large, e.g. p\,d > j^rj, then 
either x (n, T, m= [[^J T\ ) or x(n,T,m=n), which corre- 
spond to a maximal spreading of the budget, is an optimal 
symmetric allocation. 

B. Minimization of Expected Recovery Delay E [D] 

Rewriting (UJ in terms of the underlying random variables 
gives us the following c.d.f. for the recovery delay D: 



F D (t)= Y. (Fw(t)f l (l-F w (t)y-"-I 

rC{l, ...,n}: 
|r|>l 

Differentiating Fo(t) wrt t produces the p. d.f. 
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an optimal allocation (x\, . . . ,x n ) of the budget T (that is, 
subject to E™=i x i — T> where Xi > for all i) that minimizes 
E [D], for a given choice of n, A, and T. Note that the optimal 
allocation is independent of A for the minimization of E [D] 
but not for the maximization of P [D < d] . 

The optimal value of E [D] can be bounded as follows: 

Lemma 1. The expected recovery delay E [D] of an optimal 
allocation is at least 



n— 1 ■ { rT 
mm — 



E 



We make the following conjecture about the optimal allo- 
cation, based on our numerical observations: 

Conjecture. A symmetric optimal allocation always exists for 
any n, A, and T. 

As a simplification, we now proceed to restrict the opti- 
mization to only symmetric allocations (which are easier to 
describe and implement, and appear to perform well). For the 
symmetric allocation x(n,T, m), successful recovery occurs 
by a given deadline d if and only if = \y] or 

more nonempty nodes are contacted by the data collector 
by time d, out of a total of m nonempty nodes. It follows 
that the resulting recovery probability is given by P [D < d] 
= P \B(m,px,d) > \t~W - We therefore obtain the following 
c.d.f. and p. d.f. for the recovery delay D: 



F D (t) 



E 



{F w (t)) r (l-F w (t)Y 



Thus, we can compute the expected recovery delay as follows: 

poo r t i j 

E [D}= tf D (t)dt=-J2 T^T^-^( A ' T ' TO )- 

Jo A . =i m - | y | + « 

Fig. |2] compares the performance of different symmetric 
allocations over different budgets T, for an instance of n and 
A; the value of m corresponding to the optimal symmetric allo- 
cation appears to change in a nontrivial manner as we vary the 
budget T. Fortunately, we can eliminate many candidates for 
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Fig. 2. Plot of expected recovery delay E [D] against budget T for 
each symmetric allocation x(n, T, m), for (n, A)= (20, j^q)- Parameter m 
denotes the number of nonempty nodes in the symmetric allocation. The black 
curve gives a lower bound for the expected recovery delay of an optimal 
allocation, as derived in Lemma [T] 



the optimal value of to by making the following observation 
(a similar observation was made in the maximization of the 
recovery probability ifTTl ): For fixed n, A, and T, we have 



for fc = 1,2,. 



k when me ((fc - l)T,kT] 

f J , and finally, 

[^\ +1 when me ([Jl^.n 
l 



r '" 

~T 

Since \ V fc , — Wr is decreasing in m for constant A and 
/c, it follows that Ed (A, T, m) is minimized over each of these 
intervals of m when we pick to to be the largest integer in the 
corresponding interval. Thus, given n, A, and T, we can find 
an optimal to* that minimizes £x>(A,T, to) over all to from 
among f^U candidates: 



{LTj,L2Tj,...,[[^JrJ,n} 



(3) 



Note that when to = [kT\, k 6 Z + , the expected recovery 
delay simplifies to the following expression: 



E D (\,T, m=[kT\ 



1 k 



J LfcTj - fc -H 



By further eliminating suboptimal candidate values for to* 
using suitable bounds for the harmonic number, we are able to 
completely characterize the optimal symmetric allocation for 
any n, A, and T: 

Theorem 1. Suppose T — a + 1 — i, where a £ Z + , t > 1. 

x(n,T,m=LWTj) 
is an optimal symmetric allocation; if [£\ > I ^U, f/zen 
either x (n, T, m= [ [Si Tj ) or x (n, T, m=n) 
is an optimal symmetric allocation. 

If the budget T is an integer (i.e. £ = 1), then |£J < [f J is 
always true, and so x (n, T, m= [rj ), which corresponds to a 



minimal spreading of the budget (i.e. uncoded replication), is 
an optimal symmetric allocation. However, if the budget T is 
not an integer (i.e. I > 1), then the amount of spreading in the 
optimal symmetric allocation increases with the fractional part 
of T, up to a point at which either x (n, T, m=[[^J Tj ) or 
x(n,T, m=n), which correspond to a maximal spreading of 
the budget, becomes optimal. Minimal spreading (i.e. uncoded 
replication) therefore performs well over the whole range of 
budgets T, being optimal among symmetric allocations when- 
ever T is an integer (its suboptimality at noninteger T = Tq 
can be bounded by the step difference in Eu (A,T, m=[T\) 
between T = Tq and T = [To], since Ed{\, T, m) is a non- 
increasing function of T). 

In summary, we note that the optimal symmetric allocation 
for the two objectives can be quite different. In particular, 
when the budget T is an integer, we observe a phase transition 
from a regime where minimal spreading is optimal to a 
regime where maximal spreading is optimal, as the deadline 
d increases, for the maximization of recovery probability 
P[D < d]; however, with the averaging over both regimes, 
minimal spreading (i.e. uncoded replication) alone turns out to 
be optimal for the minimization of expected recovery delay 
E[D]. 

III. Simulation Study 

We apply our theoretical insights to the design of a simple 
data dissemination and storage protocol for a mobile delay- 
tolerant network. Our protocol extends SPRAY- AND- WAIT 0] 
by allowing nodes to store coded packets that are each ^ 
the size of the original data object, where parameter w is 
a positive integer; successful recovery occurs when the data 
collector accesses at least w such packets. Different symmetric 
allocations of the given total storage budget T can be realized 
by choosing different values of w; the original protocol, which 
uses uncoded replication, corresponds to w = 1. 

A. Protocol Description 

The source node begins with a total storage budget of T 
times the size of the original data object, which translates to 
wT coded packets, each — the size of the original data object. 
Whenever a node with more than one packet contacts another 
node without any packets, the former gives half its packets to 
the latter. The actual amount of data stored or transmitted by 
a node never exceeds the size of the original data object (or 
w packets) since the excess packets can always be generated 
on demand (using random linear coding, for example). To 
reduce the total transmission cost incurred, a node can also 
directly transmit one packet to each node it meets when it 
has w or fewer packets left; otherwise, these last few packets 
would be transmitted multiple times by different nodes. The 
dissemination process is completed when no node has more 
than one packet. 

B. Network Model and Simulation Setup 

We implemented a discrete-time simulation of n = 100 
wireless mobile nodes in a 1000x1000 grid. A random way- 
point mobility model is assumed where at each time step, each 
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Fig. 3. (Random Waypoint) Plots of required wait time d(Ps) against desired recovery probability Ps (semilogarithmic-scale), for budgets T = 5, 10,20. 
Each colored line represents a specific choice of parameter iu€{l,...,-S}, with w = 1 (darkest) corresponding to a minimal spreading of the budget 
(i.e. uncoded replication), and w = tjt (lightest) corresponding to a maximal spreading of the budget. The mean recovery delay corresponding to each line is 
indicated by a square marker. 
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Fig. 4. (Mobility Traces) Plots of required wait time in minutes d(Ps) against desired recovery probability Ps (semilogarithmic-scale), for budgets 
T = 5, 10, 20. Each colored line represents a specific choice of parameter w £ {l, . . . , tjt}, with w = 1 (darkest) corresponding to a minimal spreading of 
the budget (i.e. uncoded replication), and w = ^ (lightest) corresponding to a maximal spreading of the budget. The mean recovery delay corresponding to 
each line is indicated by a square marker. 
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node moves a random distance L ~ Uniform[5,10] towards 
a selected destination; on arrival, the node selects a random 
point on the grid as its next destination. Each node has a 
communication range of 20, and the bandwidth of each point- 
to-point link is large enough to support the transmission of 
w packets in one time step. At each time step, a maximal 
number of transmissions are randomly scheduled such that 
each node can transmit to or receive from at most one other 
node in range, and exactly one node may transmit in the range 
of a node receiving a transmission. In addition to this baseline 
scenario, we also considered the following two scenarios: 

(i) a high-mobility scenario, where the distance traveled 
by each node is increased to L ~ Uniform[25,50], and 

(ii) a high-connectivity scenario, where the communication 
range is increased to 80. 

We measured the recovery delay incurred by the data collector 
for two cases: 

(i) when the data recovery process begins at time 0, i.e. at 
the beginning of the data dissemination process, and 

(ii) when the data recovery process begins at time 2000, 
i.e. when the data dissemination process is already 
underway or completed. (This is a more appropriate 
performance metric for longer-term storage.) 

We ran the simulation 500 times for each choice of budget 
T G {5,10,20} and parameter w G {1, 2, . . . , ^} under each 
scenario, with a random pair of nodes appointed as the source 
and data collector for each run. 

C. Simulation Results 

Fig. |3] shows how the required wait time d(Ps), given by 

d(P s ) = mm{d : P [D < d] > P s }, 

varies with the desired recovery probability Ps for each choice 
of parameter w; these plots essentially describe how much 
time must elapse before a desired percentage of data collectors 
are able to recover the data object. The recovery probability 
performance of the protocol (which can be inferred by flipping 
the axes) is mostly consistent with our analysis in Section Hl-AI 
specifically, the phase transition in the optimal symmetric allo- 
cation is clearly discernible in most of the plots. The expected 
recovery delay performance is also mostly consistent with our 
analysis in Section lTl-BI with minimal spreading of the budget 
(w = 1) being optimal in most of the plots. 

The plots for the high-mobility scenario appear to be 
vertically scaled versions of the plots for the baseline scenario. 
This is not surprising because an increase in node mobility 
approximately translates to a speeding up of time. The effect 
of increasing node connectivity, on the other hand, seems less 
straightforward: the phase transition in the optimal symmetric 
allocation is evident for recovery starting at time 2000 but not 
for recovery starting at time 0. This discrepancy suggests that 
the data dissemination process is somewhat impeded by the 
increased connectivity, possibly due to greater interference. 

We observe that in the high recovery probability regime, 
maximal spreading of the budget (w = y) can lead to a 



significant reduction in the required wait time. For example, 
given a budget of T = 10 and a desired recovery probability 
of Ps = 0.99, choosing maximal spreading (w = 10) instead 
of minimal spreading or uncoded replication (w = 1) can yield 
a reduction of 40% to 60% in the required wait time for the 
baseline and high-mobility scenarios. 

We also observe that the recovery start time appears to 
have a limited impact on how the different allocations perform 
relative to each other; the most noticeable effect of starting 
recovery at time is the reduced spread in performance 
across different choices of parameter w, especially in the low 
recovery probability regime. This can be explained by the sim- 
ilarity of the different allocations during the data dissemination 
process: in the beginning, the different choices of parameter 
w would see the same allocation of the budget over the nodes 
because only a few nodes have been reached by the source 
directly or indirectly through relays; the different allocations 
are eventually realized only after a sufficient amount of time 
has passed. 

D. Evaluation on Mobility Traces 

To gain a better understanding of how our protocol 
might perform in a real-world setting, we evaluated it on 
a CRAWDAD data set comprising mobility traces of taxi 
cabs in San Francisco lTT4l . The traces of 100 randomly 
selected cabs with GPS coordinate readings over the span of 
an 18-day period were used. The GPS readings were sampled 
at approximately 60-second intervals; because reading times 
were not synchronized across cabs, we estimated the position 
of a cab at any given time using linear interpolation. For better 
accuracy, we assumed that a cab became inactive whenever the 
time between consecutive readings exceeded 2 minutes. As in 
the preceding simulations, we considered different scenarios 
and data recovery start times. Two scenarios were considered 
here: 

(i) a baseline scenario, where the communication range of 
each cab is 20 m, and 

(ii) a high-connectivity scenario, where the communication 
range is increased to 80 m. 

We measured the recovery delay incurred by the data collector 
for two cases: 

(i) when the data recovery process begins on day 1, and 

(ii) when the data recovery process begins on day 10, 
i.e. half-way through the 18-day period. 

We ran the simulation 500 times for each choice of budget 
T E {5,10,20} and parameter w G {l, 2, . . . , ^} under each 
scenario, with a random pair of cabs appointed as the source 
and data collector for each run. 

Fig. |4] shows how the required wait time d(Ps) varies 
with the desired recovery probability Ps for each choice of 
parameter w. Compared to the plots of Fig. [3] for the random 
waypoint simulations, these plots exhibit distinct "jumps" in 
the wait times, which can be attributed to the reduced mobility 
of the cabs at night. Despite these nonideal conditions, many 
of the observations made for the previous simulations are 
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still applicable here. For instance, the phase transition in the 
optimal symmetric allocation is discernible in most of the 
plots for the baseline scenario. Also, starting recovery on 
day 1 has the effect of reducing the spread in performance 
across different choices of parameter w, especially in the low 
recovery probability regime. 

Once again, we observe that in the high recovery probability 
regime, maximal spreading of the budget (w = ^) can lead to 
a significant reduction in the required wait time. For example, 
given a budget of T = 10 and a desired recovery probability 
of Ps — 0.99, choosing maximal spreading (w — 10) instead 
of minimal spreading or uncoded replication (w = 1) can yield 
a reduction of 30% to 50% in the required wait time for the 
baseline scenario. 

IV. Conclusion 

We examined the recovery delay performance of different 
distributed storage allocations for a network of mobile storage 
nodes. Our theoretical analysis and simulation study show that 
the choice of objective function (i.e. recovery probability vs 
expected recovery delay) can lead to very different optimal 
symmetric allocations, and that picking the right allocation 
for the given circumstances can make a significant difference 
in performance. 

The work in this paper can be extended in several directions. 
The simple contact model assumed here can be generalized to 
the case where a variable amount of data is transmitted during 
each contact between nodes. Another natural generalization is 
to allow nonuniform contact rates Xi between the data collector 
and individual nodes. 

Appendix 
Proofs of Theorems 

Proof of Lemma [7} Consider a feasible alloca- 
tion (xi, . . . ,x n ); we have ^™=i Xi — T, where Xi > 0, 
i = l,...,n. Let S r denote the number of ?'-subsets of 
{xi,...,x n } that have a sum of at least 1, where 
r 6 {1, . . . , n). Recall from Lemma 1 in ifTTl that S r can be 
bounded as follows: 



I n J j T, ( n 



We can now rewrite (f2]i in terms of S r by enumerating subsets 
according to size: 



E [D] = - I 



n-l 




1 



- Sr "(n-r)( n r ) 



(("-0 T > 



k-E- 

A \ n — r 

\ r— 1 





Proof of Theorem [7} Suppose T = a + 1 — -, where 
a 6 Z+, I > 1. Since kT = (a + l)k - f , the expected re- 
covery delay for the symmetric allocation x (n, T, m=[kT\ ), 
where k 6 Z + , can be written as 



E D (A,T, m=[kT\) = 



1 



1 



X^(a + l)k- |"f| -* + * 
1 



1 * 



Observe that [|] = v when k G ((i> — l)£,v£], for 

v = 1,2, To compare Ed (A, T, m=\kT\) within 

each of these intervals of k, we introduce Lemma |2] 

Lemma 2. For a, v, k <E Z + , k > -, the function 



k 1 

f(a, v, k) = V — — = H ak _ v 

^-^ ak — v + i 



H, 



ak — v 



decreases with k. 

Proof of Lemma [2} Let A(a,v,k) denote the difference 
in the function value between consecutive values of k, that is, 

A(a, v, k) = f(a, v, k) - f(a, v,k + l) 

= (Hak-v + k — Hak—v) ~ (Hak-v + k + a+l ~ ffofc-w+a) 
= (Hak-v + a — Hak—v) — {H a k- v + k + a + l ~ H a k- v + k) 
1 1 \ 1 



E 



ak — v + i ak — v + k + i 



ak — v + k + a+1 



J (ak — v + i) (ak — v + k + i) I ak — v + k + a + 1 

We will proceed to show that A (a, v, k) > for any 
a.v.k £ Z + , k > -. First, we find a lower bound for the 
summation term using a geometrical argument. Consider the 
function 

9(t) ± k 
the 



(ak — v + t)(ak — v + k + t)' 
which has the second derivative 

2 2 



(ak-v + t) 3 (ak-v + k + t) 3 ' 

For any a,v,k e Z< + , k > ^, the function g(t) is positive, 
decreasing with t, and convex (since g"(t) > 0), on the 
interval t G (0, oo). We therefore have the lower bound 



> J g(t)dt+ yy ' y ^ 



4^ (ak — v + i)(ak — v + k + i 
which implies that 

f (ak-v + a + l)(ak - v + k + 1) 



A(a,v,k) > In 
+ 



V (ak - v + k + a + l)(ak - v + 1) 

k 

2(ak -v + l)(ak - v + k + 1) 

k 

2(ak - v + a + l)(ak -v + k + a + 1) 

— ; = h(a,v, k). 

ak -v + k + a + 1 v ' 



9 



Now, it suffices to show that h(a, v, k) > for any 
a,v,k G Z + , k > -. This is indeed the case since 

lim /i(a, i;, fc) = 0, 

and the partial derivative -^h(a, v, k), which is given by 

a (2{ak - v + a + 1) + 1 2(aJfe— u + 1) + 1 



2 V (ak-v + a+1) 2 (ak-v + 1) 2 

a + 1 / 2(afc - « + & + !)+! 2(afc -u + A + a + l)-l 



(afc - u + k + 1) 



(afc - v + k + a + l) 2 



can be shown to be negative. ■ 
It follows from Lemma|2]that for each v G Z + , the expected 
recovery delay Ed (A, T, m= [kT\ ) decreases as k takes larger 
values in the interval f(i> — 1)1, vi\, that is, 

B D (A,T > m=L(L(«-lKJ+l)rJ) 
>E D (X,T,m=[([(v-l)£\+2)T\) 

> ... 

> E D (X,T, m=[\vt\T\). 
We will proceed to show that 

E D (A,T, m=[ [w*J TJ ) > £ D ( A, T, m= [ [£\ T\ ) 
for all v G Z + . This is equivalent to showing that 



KJ 

E 



i 



> 



w 1 

^ a\£\ - 1 



^ a I I - ?; + z ~ ^ a I ^ I - 
?— l L J l L J 

for any £ > 1, a, v G Z + . According to Lemma [2] we have 



Jj|^J+U-l 



E 



> 



E 



J a(«|/J + u - 1) - v + i 



since we can substitute £ with [£\ + r, where r G [0, 1), which 
yields 

= L«L^J + wrj = v[£\ + [vt\ <v[£\+v-1. 
Defining the function 

1 



vl+v-l 



— ^ a (v£ + v — 1) — v + ■ 



((a+l)(£+l)-l)v-(a+l) (o(^+l)-l)«-o' 



it therefore suffices to show that 

f(a,£,v)>f(a,£,v=l) 



(4) 



for any a, £, v G Z + . 

To obtain lower and upper bounds for f(a,£,v), we apply 
the following bounds for the harmonic number H n , n > 1 
E3: 

1 K n+ + 7 + ^TTF < ^ <ln ( 7i + ^) + 7+ 2i' 



where 7 is the Euler-Mascheroni constant. This produces the 
lower bound 

/lbM,w) = #lb (((0 + 1)(£+ 1) - 1)« - (a + 1)) 
-ff UB ((a(£+l)-l)«-a), 

and the upper bound 

fw(a,£,v) 4 F UB (((a + l)(i + 1) - l)« - (a + 1)) 
-flb((o(^+l)-l)w-o), 

for (a(£ + 1) — l)v — a > 1. The lower bound /lb (a, ^, v) is 
an increasing function of u for any a > 1, £ > 1, u > 2, since 
the partial derivative -§^f\^{a,£,v), which is given by 

2(1 ~ 1) 

(2((o + + 1) - l)f - 2(a + 1) + 1) (2(a(£ + 1) - l)u - 2a + l) 

+ a(l + 1) - 1 (a + l)(l + l)-l 

12 ((o(^ + 1) - l)v - a) 3 12 (((a + 1)(^ + 1) - l)« - a) 3 ' 

can be shown to be positive. We therefore have 

f(a,£,v) > fui(a,£,v) > f hB (a,£,v=2) 

for any v > 2, a,£,v G Z + . We now proceed to demonstrate 
that f LB (a,£,v=2) > f(a,£,v=l). 

For the case £ = 1, consider the function 

5(a) = /lb (a, *=1, «=2) - /(a, ^=1, w=l) 
'2a + l\ 81a 4 -71a 2 + 16 



In 



2a - 1 



a(9a 2 - 4) 2 



It suffices to show that g(a) > for any a > 1, which is 
indeed the case since 



and the derivative 



lim 17(a) = 0, 



621a 6 - 961a 4 + 436a 2 - 64 
a 2 (4a 2 - l)(9a 2 - 4) 3 



is negative. 

For the case £ > 2, we consider the function 

Ha,£) = /lbM,«=2) - /ubK^«=1), 

which can be shown to be nonnegative for any a > 1, £ > 2. 
It follows that 

yb(a,^»=2) > /ubM,v=1) > /(a,£,«=l) 

for any £>2, a,£e Z+. 

Combining these results, we obtain 

f(a,£,v) > f LB (a,£,v) > f LB (a,£,v=2) > f(a,£,v=l) 

for any v > 2, a,£,v G Z + , which gives us inequality as 
required. Consequently, we have 

E D (A, T, m= LfcTj ) > ^ D (A, T, m= L W ) 

for any k G Z + . Since 

£ D (X,T,m=\ LfjTj) 



£d (A,T,m=n) 



=Hub(") 



if f € Z+, 

> £ D (A, T, m= [( [f J + 1) T\ ) otherwise, 



we also have 

E D (X,T,m=n) > E D (X,T,m=H£\T\). 

Therefore, if [£\ < then x (n, T, m=[ W T J ) is an op- 
timal symmetric allocation. On the other hand, if [£\ > [yj > 
then we can eliminate all but the two largest candidate values 
for m* in (|3), since 

E D (A, T, m= LTJ )>E D (\,T, m= [2T\ )>■■■ 

>E D (X,T, m =[[iL\T\) 

by Lemma |2 ■ 
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